Add automatic tests for metrics by NathanHB · Pull Request #939 · huggingface/lighteval

NathanHB · 2025-08-27T14:08:32Z

Adds mechanism to auto test metric. When creating a metric you now create a json file with test cases (input, output and expected results).
move unit test to a tests/unit folder.
fix broken metrics

HuggingFaceDocBuilderDev · 2025-08-27T14:12:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…/lighteval into nathan-add-tests-for-metrics

tests/unit/metrics/test_metrics_automated.py

…metrics

…/lighteval into nathan-add-tests-for-metrics

Copilot

Pull Request Overview

This pull request adds an automated testing framework for LightEval metrics to ensure their reliability and correctness. The automated testing system allows developers to define test cases with input/output pairs in JSON files, which are then automatically validated against metric implementations.

Creates an automated testing framework for metrics with JSON-based test case definitions
Moves existing unit tests from tests/tasks to tests/unit/tasks folder for better organization
Fixes broken metrics by correcting function signatures and implementing missing dependencies

Reviewed Changes

Copilot reviewed 57 out of 82 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/unit/tasks/test_registry.py	Updates custom task import paths to reflect new unit test folder structure
tests/unit/metrics/test_metrics_automated.py	Core automated testing framework implementation with test case models and execution logic
tests/unit/metrics/test_automated_metrics_pytest.py	Pytest integration for the automated testing framework
tests/unit/metrics/pytest.ini	Pytest configuration for metric testing
tests/unit/metrics/test_cases/*.json	JSON test case files for various metrics (stored in Git LFS)
src/lighteval/metrics/metrics_sample.py	Fixes broken metric implementations including function signatures and NLTK dependencies
src/lighteval/metrics/metrics_corpus.py	Adds NLTK download and fixes F1 score calculation
src/lighteval/metrics/metrics.py	Corrects F1 score averaging parameter
src/lighteval/metrics/imports/summac.py	Removes deprecated tokenizer parameter
src/lighteval/tasks/extended/lcb/main.py	Adds missing batched_compute parameter
pyproject.toml	Updates test dependencies

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/unit/metrics/test_metrics_automated.py

src/lighteval/metrics/metrics_sample.py

clefourrier

Missing a doc section in Adding a new metric to explain how to create a test file. A bunch of nits.

But overall, cool new mechanism, with clearer login! GG

src/lighteval/metrics/metrics_sample.py

tests/unit/metrics/test_cases/acc_golds_likelihood.json

tests/unit/metrics/test_metrics_automated.py

NathanHB · 2025-09-15T13:00:05Z

.gitattributes

@@ -1 +1,2 @@
 *.json filter=lfs diff=lfs merge=lfs -text
+tests/unit/metrics/test_cases/*.json -filter -diff -merge text


do not use git-lfs for json files in the test_cases dir

…metrics

Copilot

Pull Request Overview

Copilot reviewed 59 out of 84 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/lighteval/metrics/metrics_sample.py

Copilot · 2025-09-16T09:21:57Z

tests/unit/metrics/test_metrics_automated.py

+        if metric_params != {}:
+            metric = self.METRIC_CLASSES[metric_class].value
+            metric_enum_value = copy.deepcopy(metric)(metric_params)
+        else:
+            metric_enum_value = self.METRIC_CLASSES[metric_class].value


Line 126 should access the metric enum, not the value. It should be metric = self.METRIC_CLASSES[metric_class] without .value, then call metric.value on line 127.

tests/unit/metrics/test_metrics_automated.py

src/lighteval/metrics/metrics_sample.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Adds mechanism to auto test metric. When creating a metric you now create a json file with test cases (input, output and expected results). - move unit test to a tests/unit folder. - fix broken metrics --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

NathanHB and others added 4 commits August 27, 2025 09:28

Fixe Sampling Metrics and Evals

d8cfc2e

remove breakpoint

7ae5da5

add auto tests for metrics

a00f3c0

Merge branch 'main' into nathan-add-tests-for-metrics

a892260

NathanHB and others added 4 commits August 27, 2025 16:12

Delete tests/unit/metrics/test_cases/README.md

bf25211

Delete tests/unit/metrics/test_unit_harness_metrics.py

2b65d08

add pip as test dependency, for spacy to work correctly

594b942

Merge branch 'nathan-add-tests-for-metrics' of github.com:huggingface…

6db8263

…/lighteval into nathan-add-tests-for-metrics

NathanHB commented Aug 27, 2025

View reviewed changes

tests/unit/metrics/test_metrics_automated.py Outdated Show resolved Hide resolved

NathanHB marked this pull request as draft August 27, 2025 14:17

NathanHB and others added 15 commits August 28, 2025 10:03

fix tests and reorg files

9f7c2be

fix tests and reorg files

e1a55ac

better tests, passing

c9e7243

Merge remote-tracking branch 'origin/main' into nathan-add-tests-for-…

e493b49

…metrics

Merge remote-tracking branch 'origin/main' into nathan-add-tests-for-…

5f323b7

…metrics

fix tests

3d7b448

fix faithfullness metric

0c4a554

adds corpus level metric testing

594c269

fix bleu metric

fc01e6b

fix bleu metric

c574035

Merge branch 'main' into nathan-add-tests-for-metrics

e127955

fix tests after merge

51db828

Delete tests/slow_tests/test_sglang_model.py

70a5a10

test simpleqa judge

6384835

Merge branch 'nathan-add-tests-for-metrics' of github.com:huggingface…

3c9aec6

…/lighteval into nathan-add-tests-for-metrics

NathanHB marked this pull request as ready for review September 8, 2025 13:28

NathanHB requested review from clefourrier and Copilot and removed request for Copilot September 8, 2025 13:47

NathanHB requested a review from Copilot September 8, 2025 13:47

Copilot AI reviewed Sep 8, 2025

View reviewed changes

clefourrier mentioned this pull request Sep 9, 2025

[BUG] PreTrainedTokenizerFast._batch_encode_plus() got multiple values for keyword argument 'truncation_strategy #949

Open

fix avg at k

b5b82a8

clefourrier reviewed Sep 9, 2025

View reviewed changes

NathanHB self-assigned this Sep 15, 2025

NathanHB added the feature label Sep 15, 2025

NathanHB added 2 commits September 15, 2025 12:57

remove test files from git lfs cache

bf740a3

re-add test-files to actual repo

ef216dc

NathanHB commented Sep 15, 2025

View reviewed changes

NathanHB added 2 commits September 15, 2025 13:08

use SKIPPED_METRIC list instead of hardcoding all metric names

f903ee0

Merge remote-tracking branch 'origin/main' into nathan-add-tests-for-…

86892e9

…metrics

NathanHB requested review from clefourrier and Copilot September 16, 2025 09:20

Copilot AI reviewed Sep 16, 2025

View reviewed changes

clefourrier approved these changes Sep 16, 2025

View reviewed changes

NathanHB and others added 6 commits September 16, 2025 13:33

Update tests/unit/metrics/test_metrics_automated.py

23e9714

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix tests

048b407

remove breakpoint

c4aebce

remove breakpoint

432345e

Merge branch 'main' into nathan-add-tests-for-metrics

dab1dae

fix quality

fd27034

NathanHB changed the title ~~Add auto tests for metrics~~ Add automatic tests for metrics Sep 17, 2025

NathanHB merged commit 16318bb into main Sep 17, 2025
5 checks passed

		@@ -1 +1,2 @@
		*.json filter=lfs diff=lfs merge=lfs -text
		tests/unit/metrics/test_cases/*.json -filter -diff -merge text

Conversation

NathanHB commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NathanHB Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NathanHB commented Aug 27, 2025 •

edited

Loading